Back

Computational and Structural Biotechnology Journal

American Association for the Advancement of Science (AAAS)

Preprints posted in the last 7 days, ranked by how well they match Computational and Structural Biotechnology Journal's content profile, based on 216 papers previously published here. The average preprint has a 0.30% match score for this journal, so anything above that is already an above-average fit.

1
Molecular basis of Salla Disease: R39C Mutation Effects on the Lysosomal Transporter Sialin

Matsingos, C.; Lot, I.; Vaz, M.; Mailliart, J.; Boulayat, M.; Debacker, C.; Goupil-Lamy, A.; Gasnier, B.; Acher, F. C.; Anne, C.

2026-04-22 biochemistry 10.64898/2026.04.20.719580 medRxiv
Top 0.1%
23.7%
Show abstract

Salla disease is caused by a genetic mutation in sialin, a lysosomal membrane transporter, which exports sialic acid from lysosomes. Substrate translocation occurs via a rocker-switch mechanism that alternately exposes the substrate-binding site to the lysosomal lumen and the cytosol. The pathogenic mutation R39C found in most Salla disease patients decreases the lysosomal localisation and the transport activity. In this study, we used computational and mutagenesis approaches to elucidate the molecular effects of the R39C mutation. Using three-dimensional models of human sialin in the lumen-open (LO) and cytosol-open (CO) states combined with the mutagenesis of selected residues, we identify a critical "triplet" motif comprising R39, E194, and E262, which is associated with an ionic lock formed between K197 and D350 in the LO conformation. Molecular dynamics simulations suggest that the electrostatic triplet negatively modulates the ionic lock, and are consistent with a strengthened ionic lock in R39C sialin, potentially favouring the LO state. To assess the global effects of the R39C mutation, we computed dynamic cross-correlation matrices and identified correlation patterns consistent with an allosteric coupling between the ionic lock K197/D350 and the region surrounding the sialic acid binding site in wild-type sialin, whereas in the LO state of R39C sialin, this communication preferentially bypasses this region. Therefore, the R39C mutation may impede the LO to CO conformational transition required for sialic acid transport, providing a plausible mechanistic framework for the decreased transport activity, and possibly the decreased lysosomal localisation, observed in Salla disease. HighlightsO_LIThe R39 residue participates in an interaction triplet, which negatively regulates an ionic lock stabilising the lumen-open conformation C_LIO_LIThe R39C mutation is associated with a stronger ionic lock in the simulations, and may favour the lumen-open state C_LIO_LICorrelation network analysis suggests an allosteric coupling between the ionic lock and the region surrounding the sialic acid binding site C_LIO_LIThe R39C mutation alters the inferred allosteric coupling between the ionic lock and the region surrounding the sialic acid binding site C_LI Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/719580v1_ufig1.gif" ALT="Figure 1"> View larger version (37K): org.highwire.dtl.DTLVardef@1ed0f72org.highwire.dtl.DTLVardef@913798org.highwire.dtl.DTLVardef@1d8e5adorg.highwire.dtl.DTLVardef@cf0060_HPS_FORMAT_FIGEXP M_FIG C_FIG

2
GNOMES: an integrated framework for genome-wide normalization and differential binding analysis of CUT&RUN and ChIP-seq data

Roule, T.; Akizu, N.

2026-04-21 bioinformatics 10.64898/2026.04.16.718722 medRxiv
Top 0.2%
8.5%
Show abstract

BackgroundDespite their use, quantitative comparison of epigenomic datasets such as ChIP-seq and CUT&RUN remains challenging, particularly due to difficulties in signal normalization across samples and conditions. Normalization solely based on sequencing depth is often insufficient due to the high variability in signal-to-noise ratios across samples, even from a same experiment. While exogeneous spike-in normalization can address some issues, robust spike-in controls are not always available, and may introduce additional experimental burden and computational complexity. Furthermore, normalization and differential binding analysis are typically performed using separate bioinformatics tools. Indeed, most differential analysis frameworks operate on raw count matrices, preventing users from visually inspecting normalized signal tracks and evaluating how normalization influences the results. To overcome these challenges, we developed GNOMES (Genome-wide NOrmalization of Mapped Epigenomic Signals), a framework that integrates signal normalization, quality control, and differential binding analysis within a unified workflow. ResultsGNOMES is a user-friendly tool able to process ChIP-seq and CUT&RUN datasets from aligned reads, and generate normalized coverage profiles and differential binding results. The tool implements a robust genome-wide normalization strategy based on percentile scaling of signal local maxima, enabling stable normalization between biological replicates and conditions. GNOMES supports both single- and paired- end sequencing, does not required a negative control (input or IGG), and can be applied to both broad (histone marks) or narrow (transcription factor) enrichment patterns. The workflow includes normalization, optional consensus peak identification, and differential binding analysis. For each step, GNOMES generates extensive quality-control metrics and visual outputs, including normalized bigWig tracks, median signal tracks, BED files of regions with significant changes, and diagnostic plots such as heatmaps and PCA. GNOMES is highly configurable and integrates established tools such as MACS2 for candidate peak regions identification for differential binding analysis, as well as DESeq2 and edgeR for statistical testing. Finally, GNOMES is organism-agnostic and can be applied to epigenomic datasets from any model system. ConclusionsGNOMES provides an integrated and highly customizable environment for normalization and differential binding analysis of epigenomic sequencing data. By integrating signal normalization, with downstream differential statistical method for differential binding analysis, and comprehensive quality control, GNOMES simplifies the analysis of ChIP-seq and CUT&RUN datasets, for the identification of chromatin changes.

3
Network-Based Functional Fragility Reveals System-Level Reorganization Of The Gut Microbiome In Inflammatory Bowel Disease

Kenavdekar, M. V.; Natarajan, E.

2026-04-21 bioinformatics 10.64898/2026.04.16.719113 medRxiv
Top 0.3%
8.2%
Show abstract

The human gut microbiome plays a critical role in host health, yet its functional organization in disease remains poorly understood. Most studies focus on taxonomic composition or pathway abundance, which fail to capture higher-order interactions governing system-level behavior. Here, we investigated microbiome functional organization in inflammatory bowel disease (IBD), including Crohns disease (CD), ulcerative colitis (UC), and healthy controls (HC), using a network-based framework across 60 metagenomic samples. Functional pathway profiles were used to construct correlation-based interaction networks, followed by analysis of network topology, functional redundancy, keystone pathway architecture, and system robustness. Disease-associated networks (CD and UC) exhibited reduced global connectivity, increased modular fragmentation, and centralization of keystone pathways, indicating a shift from distributed organization to more fragmented and fragile network structures compared to healthy controls. Notably, machine learning models demonstrated that network-derived features achieved higher classification performance (accuracy up to 0.824) compared to redundancy-based measures. These findings reveal that microbiome dysfunction in IBD is driven by large-scale reorganization of functional interaction networks rather than loss of functional capacity. This study highlights the importance of network-level analysis in understanding microbiome-associated disease and provides a systems-level framework for future research.

4
Unraveling the potential of short and long read sequencing for human genome profiling

Leduc, A.; Bachr, A.; Sandron, F.; Delepine, M.; Delafoy, D.; Fund, C.; Daviaud, C.; Meslage, S.; Turon, V.; Bacq-Daian, D.; Rousseau, F.; Olaso, R.; Deleuze, J.-F.; Gerber, Z.; Meyer, V.

2026-04-22 genomics 10.64898/2026.04.20.719568 medRxiv
Top 0.7%
6.2%
Show abstract

Background: Short read sequencing technologies have dominated the field of human whole genome sequencing in the past years in terms of cost, throughput, and accuracy. However, thanks to recent technological evolution, long read approaches have become increasingly competitive and complementary to short reads. With the gap in the cost per genome closing slowly between both approaches, long reads might replace short read sequencing in future research and clinical applications. Still, comprehensive evaluation is necessary to conclude on the performance and general advantages of each technology. Results: In this study, we compared the latest chemistries of major suppliers of short and long read technologies: Illumina short reads, Illumina Complete Long Reads (ICLR), Pacific Biosciences HiFi reads (PacBio), and Oxford Nanopore Technologies long reads (ONT). Using the HG002 human reference sample and established bioinformatics guidelines, we assessed their variant calling performance against the latest available truth sets at different levels of coverage. For single nucleotide variant detection, all technologies were equivalent. Despite the latest improvements in chemistry, indel calling with ONT continues to lag in accuracy behind other technologies. In contrast, long reads delivered a clear advantage in structural variant detection, surpassing short reads in both accuracy and sensitivity. The hybrid ICLR approach achieved intermediate performance, narrowing the gap between short and long read sequencing. Furthermore, long reads enhanced haplotype-phasing resolution, enabling the phasing of over 80% of the genome. Conclusions: These findings highlight the specific strengths and limitations of recent sequencing technologies, aiding the decision-making in future research projects, technological platforms development, and clinical applications.

5
Spatial profiling of CAR protein organization reveals in vivo remodeling during CAR-T therapy

Kashima, Y.; Makishima, K.; van Ooijen, H.; Franzen, L.; Petkov, S.; Nishikii, H.; Zenkoh, J.; Suzuki, A.; Branting, A.; Sakata-Yanagimoto, M.; Suzuki, Y.

2026-04-22 genomics 10.64898/2026.04.20.719384 medRxiv
Top 0.8%
4.9%
Show abstract

Chimeric antigen receptor (CAR) T cell therapy utilizes genetically engineered patient-derived T cells to target cancer cells. Despite its clinical successes in multiple cancer types, the underlying molecular mechanisms by which molecules on CAR-T cells and surrounding cells interact with other proteins and collectively determine treatment efficacy remain elusive. Most previous studies have relied on transcriptome profiling, which does not fully reflect protein-level organization and interactions. In this study, we developed an antibody-oligonucleotide conjugate targeting the FMC63 region of CAR and integrated it into molecular pixelation (MPX). This approach enabled profiling of the dynamics of CAR molecules on cell surfaces as well as their colocalization with other proteins at the single-cell level. By applying MPX to longitudinal samples from three patients undergoing CAR-T cell therapy, we characterized the dynamic changes in CAR-associated protein organization in both pre-infusion CAR products and post-infusion peripheral blood. While CAR protein abundance and polarization showed limited variation across clinical courses, remodeling of a CAR-centered co-localization network was observed over time, including different retentions of specific molecular associations between patients with different clinical outcomes. Although derived from a limited cohort, our study identifies insights from this methodological framework beyond those gained by conventional omics analyses and offers results of a systematic investigation to predict and enhance CAR therapeutic outcomes. Key pointsO_LIMolecular pixelation was applied for chimeric antigen receptor (CAR) profiling at single-molecule and single-cell resolutions. C_LIO_LIProtein and transcriptome analyses of the CAR molecule showed dynamic remodeling during CAR-T therapy in patients with non-Hodgkin lymphoma. C_LI

6
Reveal Principles of Codon Optimization via Machine Learning

Deng, F.; Li, H.; Sun, D.; Duan, G.; Sun, Z.; Xue, G.

2026-04-21 bioinformatics 10.64898/2026.04.16.718958 medRxiv
Top 0.8%
4.9%
Show abstract

High level of protein expression is usually welcomed in industry and research, and codon optimization is widely used to achieve high expression. Methods of implementing codon optimization can be divided into two branches, one is classical methods which develop cost functions based on empirical law, another is AI methods which learn the codon choice principles from endogenous genes with neural networks. Here we develop two codon optimization tools based on two branches respectively, namely OptimWiz 2.1 and OptimWiz 3.0. Results of fusion protein fluorescence detection indicate that both OptimWiz 2.1 and OptimWiz 3.0 are superior to all the other commercially available codon optimization tools. Principles of codon optimization are revealed in the process of machine learning on both tools.

7
A Systems Pharmacology Model of Ageing Identifies Optimal Combination Therapies With Secondary Benefits on Weight Loss and Metabolic Health

Goryanin, I.; Damms, B.; Goryanin, I.

2026-04-23 pharmacology and therapeutics 10.64898/2026.04.22.26351392 medRxiv
Top 0.8%
4.9%
Show abstract

Background: Ageing is a systems level biological process underlying the onset and progression of multiple chronic disorders. Rather than arising from a single pathway, age related decline reflects interacting disturbances in metabolic regulation, inflammation, nutrient sensing, cellular stress responses, and tissue repair. Although GLP1 receptor agonists, sodium glucose cotransporter2 inhibitors, metformin, and rapamycin are usually evaluated against disease-specific endpoints. Objective: To develop an SBML compliant quantitative systems pharmacology model in which ageing is the primary pharmacological endpoint and to evaluate which combination therapy provides the greatest benefit for both metabolic and ageing related outcomes. Methods: We developed model comprising four layers: a metabolic/pharmacodynamic layer describing weight loss, HbA1c reduction, and nausea with tolerance; a drug layer capturing class-specific effects of GLP1 agonists, sodium glucose cotransporter2 inhibitors, metformin, and rapamycin; an ageing layer representing damage accumulation, repair capacity, frailty, and biological age gap; and a biomarker layer generating trajectories and estimated glucose disposal rate. Calibration was staged across semaglutide clinical endpoints. Bayesian hierarchical meta analysis, global sensitivity analysis, and practical identifiability analysis were used to assess robustness and interpretability. Results: The model reproduced semaglutide efficacy and tolerability dynamics and supported distinct drug-class profiles across metabolic and ageing axes. Rapamycin showed minimal glycaemic effect but emerged as a dominant driver of repair related ageing outcomes. Combination simulations predicted two distinct optima: one favouring metabolic improvement and one favouring ageing related benefit. Conclusion: The model supports the view that metabolic and ageing optimization are mechanistically distinct objectives and that weight loss and glycaemic improvement alone may be insufficient surrogates for health span benefit.

8
Closed-Loop Multi-Objective Optimization for Receptor-Selective Cell-Penetrating Peptide Design

Yamahata, I.; Shimamura, T.; Hayashi, S.

2026-04-21 bioinformatics 10.64898/2026.04.16.718169 medRxiv
Top 1.0%
4.4%
Show abstract

Cell-penetrating peptides (CPPs) can deliver diverse cargos into cells. However, designing CPPs with receptor-selective interaction profiles remains difficult because interactions with individual cell-surface components cannot be tuned independently. Here, we developed a closed-loop in silico framework for receptor-selective CPP design, in which receptor interactions are formulated as explicit objectives in a multi-objective optimization problem. We first constructed a CPP-like candidate library using a sequence generative model fine-tuned on known CPPs. The framework then evaluated candidate peptides by receptor-wise docking, molecular dynamics simulations, and MM/GBSA to compute receptor-wise binding scores. These scores were used iteratively to propose subsequent candidates by multi-objective Bayesian optimization. Applied to a CXCR4/NRP1 design setting, the framework identified candidates with more favorable predicted interaction profiles, characterized by higher CXCR4 binding scores and lower NRP1 binding scores. We selected 10 peptides from the computationally identified candidates for cell-based imaging and found that 4 showed higher enrichment in CXCR4-positive regions than in NRP1-positive regions under the tested conditions. These results show that the proposed framework provides a practical in silico approach for designing CPPs with receptor-selective interaction profiles.

9
Bi-level diversity optimisation for representative protein panel selection

Ou, Z.; James, K.; Charnock, S.; Wipat, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719243 medRxiv
Top 1%
4.3%
Show abstract

Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to improve diversity. The method formulates panel design as a combinatorial optimisation problem over pairwise distance matrices, combining a MaxMin objective to enforce minimum separation between selected sequences with a MaxSum objective to increase global dispersion. This formulation enables the direct construction of fixed-cardinality panels while remaining independent of the similarity representation used to compute pairwise distances. Benchmarking across four Pfam families shows that the bi-level formulation consistently reduces redundancy among selected sequences, lowering maximum pairwise identity by 43-46% relative to the previous MaxSum-based formulation, while maintaining comparable or improved EC-label coverage. The framework can incorporate sequence- or structure-based similarity measures, providing a flexible strategy for constructing diverse representative panels across homologous protein families.

10
Benchmarking Generative Large Language Models for de novo Antibody Design and Agentic Evaluation

Hossain, D.; Abir, F. A.; Zhang, S.; Chen, J. Y.

2026-04-21 bioinformatics 10.64898/2026.04.18.716776 medRxiv
Top 1%
4.3%
Show abstract

Despite major advances in computational antibody engineering, no systematic comparison of modern open-source LLM backbone families for antibody sequence generation exists, nor is it known whether architectural differences matter at compact model scales. In this study, five compact transformer variants inspired by prominent open-source LLM families (Llama-4, Gemma-3, DeepSeek-V3, Mistral 7B, and NVIDIA Nemotron-3) were customized and trained from scratch for de novo VH single-domain antibody (sdAb) design. All five models were pretrained from scratch on 15 million sequences from the Observed Antibody Space (OAS) database. Pretraining yielded uniformly high generative fidelity across architectures: sequence diversity 0.507-0.516 (CV=0.8%), uniqueness approaching 1.0, and novelty 0.925-0.977 (CV=2.2%). The models were subsequently fine-tuned on disease-stratified repertoires spanning SARS-CoV-2 (n=4,688), HIV (n=430), HER2 (n=22,778), and Ebola virus (n=2,868). Structural assessment of top-ranked candidates of those case studies via AlphaFold-2, Boltz-2, RoseTTAFold-2, and ESMFold produced mean pLDDT scores of 92.88{+/-}1.54 to 93.77{+/-}2.16, with no statistically significant inter-model differences (Kruskal-Wallis H=2.06, p>0.05; N=100), indicating no statistically detectable difference was observed across architectures at this compressed scale in a single-seed experiment, suggesting that generative capacity at this parameter regime is primarily determined by training data and model scale rather than family-specific design elements at this scale. Computational docking yielded predicted binding free energies of -36.34 to -65.60 kcal/mol; independent biological rigor validation through IMGT-defined CDR-H3 extraction, BLASTp novelty assessment, and NetMHCIIpan 4.3 MHC-II immunogenicity profiling collectively confirmed antigen-binding loop novelty (CDR-H3 identity 0-29% to closest database hits), germline-consistent humanness (77-90% VH germline content), and immunogenically silent antigen-binding surfaces with no strong MHC-II binders detected across CDR regions in any candidate. We further introduce a proof-of-concept agentic evaluation pipeline leveraging the Model Context Protocol (MCP) with Claude Sonnet 4.6, enabling automated structural profiling and candidate prioritization across disease targets.

11
Computational Drug Repurposing Targeting LuxS-Mediated Quorum Sensing in Fusobacterium nucleatum: A Virtual Screening and Molecular Dynamics Approach

Cedeno, K.; De Leon, D.; Chiari, M.

2026-04-21 microbiology 10.64898/2026.04.20.719701 medRxiv
Top 1%
4.1%
Show abstract

Fusobacterium nucleatum is an anaerobic bacterium strongly associated with the development and progression of colorectal cancer (CRC). Its pathogenic mechanisms involve the LuxS/AI-2 quorum sensing (QS) system, which regulates biofilm formation, virulence factor expression, and host immune evasion. Targeting LuxS represents a promising anti-virulence strategy that could disrupt bacterial communication without inducing selective pressure for antibiotic resistance. In this study, we employed a computational drug repurposing pipeline to identify FDA-approved drugs capable of inhibiting the LuxS enzyme in F. nucleatum. We performed structure-based virtual screening of 9,466 compounds from DrugBank using AutoDock Vina against the AlphaFold-predicted LuxS structure (UniProt: A0A133NIU3). From 1,082 initial hits (binding energy [&le;] - 7.0 kcal/mol), we applied ADMET filtering and composite scoring to select the top 5 candidates. Molecular dynamics simulations (10 ns each) using OpenMM with the AMBER14 force field confirmed the stability of all five protein-ligand complexes (RMSD < 2.0 [A]). The most promising candidates include Tubocurarine ({Delta}G = -16.97 kcal/mol, RMSD = 1.87 [A]), Docetaxel ({Delta}G = -13.22 kcal/mol, RMSD = 1.81 [A]), Metyrosine ({Delta}G = -13.78 kcal/mol, RMSD = 1.97 [A]), and Ergometrine ({Delta}G = -13.22 kcal/mol, RMSD = 1.92 [A]). These results constitute an exploratory computational basis that requires subsequent experimental validation through in vitro and in vivo assays, and provide candidates for testing as anti-quorum sensing agents against F. nucleatum, with potential implications for CRC prevention and treatment.

12
Structure-aware graph attention based hierarchical transformer framework for drug-target binding affinity prediction

Kaira, V. S.; Kudari, Z. D.; P, S. S.; Bhat, R.; G, J.

2026-04-22 bioinformatics 10.64898/2026.04.19.719524 medRxiv
Top 1%
4.0%
Show abstract

Drug-target interaction prediction is significant in the hit identification phase of drug discovery, enabling the identification of potential drug candidates for downstream optimization. Traditional computational methods have some drawbacks in their ability to represent 3D structural data for both molecules and target proteins, which is required for the intricate protein-ligand interactions that regulate binding affinity. In this approach, we propose a graph transformer-based model (GTStrDTI) that combines an intragraph attention mechanism with cross-modal attention to enrich the representation of both the drug molecule and target protein. This approach comprehensively models both intramolecular structural features and intermolecular interactions, thereby enhancing binding affinity prediction performance. A thorough evaluation on benchmark datasets such as KIBA, DAVIS, and BindingDB_Kd shows that our approach surpasses the state-of-the-art methods under challenging target cold-start settings. Our analysis found that augmenting graph-based 3D structural protein target (C-alpha contact graphs from PDB with threshold distance of 5[A]) and incorporating molecule adjacency information, boosts predictive performance, thus contributing towards narrowing the gap between computational and experimental research.

13
Physics-Guided Deep Neural Networks: Correcting Physical Distortions in Protein Phase Separation Prediction

Wang, M.; Lu, T.; Song, Y.-h.; Li, y.

2026-04-21 cell biology 10.64898/2026.04.18.719364 medRxiv
Top 1%
3.7%
Show abstract

BackgroundIn computational biology, embedding known physical laws into deep learning models to construct "Physics-Informed Neural Networks" (PINNs) is a mainstream paradigm for enhancing model interpretability and extrapolation capability. However, in complex multi-physics coupling problems, there is a risk of competitive imbalance between the physical term and the flexible artificial intelligence (AI) residual term, causing the model to degenerate into a "black-box" fit and lose the original purpose of being physics-driven. MethodsIn this study, targeting the problem of predicting protein liquid-liquid phase separation (LLPS) behavior in response to environmental factors (temperature, salt concentration), we identified physical distortions, gradient vanishing, and numerical instability in the initial physics-AI hybrid model. Three core correction strategies were proposed: (1) Weight Allocation Logic Reconstruction: Force the physical trunk weight to 1.0 at the output layer, suppressing the AI residual term to the perturbation level of 0.05~0.1, ensuring physics dominance; (2) Robust Physics Formula Construction: Abandon the unstable power function and introduce a combination of Softplus and logarithmic functions to stably simulate the nonlinear effects of charge shielding; (3) Gain Compensation Alignment: Apply gain compensation to the weak signal branch (temperature) to ensure its effective participation in optimization. ResultsThe optimized model maintained a fitting accuracy of R2{approx}0.62 on the test set, while physical consistency was significantly enhanced. The model successfully restored the monotonic increase in solubility with temperature characteristic of UCST-type phase diagrams and correctly captured the nonlinear charge shielding features in the salt concentration response. The weights of key physical parameters (e.g., hydrophobic contribution w_h, net charge contribution w_ncpr) increased from <10-3 to the 10-2 magnitude, demonstrating the reactivation of the physical branch. ConclusionsThe weight control, formula stabilization, and signal gain alignment strategies proposed in this study effectively address the classic problem of "AI hijacking" physics in physics-AI hybrid models. This work provides a universal solution for constructing biophysical predictive models that combine high fitting accuracy with strong physical interpretability.

14
A multimodal exploration of circulating inflammatory markers in patients undergoing surgical intervention for lumbar disc herniation in selected hospitals of Sri Lanka

Aravinth, P.; Withanage, N. D.; Senadheera, B. M.; Pathirage, S.; Athiththan, S. P.; Perera, S. L.; Athiththan, L. V.

2026-04-23 orthopedics 10.64898/2026.04.21.26351426 medRxiv
Top 2%
3.4%
Show abstract

Background Inflammatory markers play an important role in the pathophysiology of Lumbar disc herniation (LDH). This study presents a comprehensive multi-assessment of the inflammatory landscape by combining serum inflammatory cytokines quantification, their diagnostic performance, associations with radiological features, and integrating the experimental findings into an in-silico protein-protein interaction network. Methods A multifaceted study design was utilized to quantify and compare the distribution of selected inflammatory cytokines in patients with LDH and control subjects. The diagnostic ability of these cytokines was assessed using receiver operating characteristic curve analysis. The cytokines values were correlated with selected radiological findings including disc herniation subtypes (protrusion, extrusion, and sequestration), and further categorized as contained and non-contained in patients using a Spearmans rank correlation test. Additionally, computational analysis was performed to identify the central hubs and functionally enriched pathways. Results In patients with LDH, IL-6 and IL-1{beta} showed statistically significant (IL-6: p < 0.001; IL-1{beta}: p = 0.001) rise, but IL-6 showed high diagnostic and discriminative power (AUC = 0.99; cut-off: 19.99 pg/mL). Further IL-1{beta} exhibited a positive correlation with non-contained disc herniation (extrusion and sequestration), while displaying a significant (p < 0.05) negative correlation with protrusion. In silico analysis identified IL-1{beta}, IL-8, TNF-, IL-6, IL-1, CSF2, CSF3, and IL-10 as central hubs, with IL-1{beta} being the top ranked hub in determining functionally enriched cytokine-cytokine receptor interaction. Conclusions Study confirmed IL-6 as a powerful diagnostic marker for LDH, while IL-1{beta} aids in determining contained and non-contained disc herniation. Further, IL-1{beta} was identified as the central hub, triggering functionally enriched pathways in the pathogenesis of LDH.

15
Benchmarking single-cell foundation models for real-world RNA-seq data integration

Han, S.; Sztanka-Toth, T.; Senel, E.; Elnaggar, A.; Patel, J.; Mansi, T.; Smirnov, D.; Greshock, J.; Javidi, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719314 medRxiv
Top 2%
3.2%
Show abstract

Single-cell foundation models enable reusable representations and streamlined analysis workflows, yet rigorous evaluation of their performance and robustness in real-world pharmaceutical settings remain underexplored. Here, we benchmarked leading single-cell foundation models (scGPT; scGPT_CP, a continually pretrained checkpoint of scGPT; scFoundation; scMulan; CellFM) against established baseline methods (scVI; Harmony) for data integration using over 1.5 million cells from clinical and preclinical samples. Performance was assessed using well-established and complementary metrics for technical correction and biological structure preservation. We further introduced robustness-oriented rankings to summarize metric trade-offs and quantify performance consistency across datasets and evaluation settings. Our findings show that fine-tuning improved technical correction performance; among the foundation models, fine-tuned scGPT_CP performed best. However, the baseline scVI was the top overall performer, ranking first by our multi-metric Leximax ranking and achieving the highest Pareto Front-1 hit. Collectively, our study provides practical insights for adapting foundation models to real-world drug design and development.

16
CD8scape: an accessible, command-line tool for predicting viral escape from the CD8+ T cell response

Smith, E. W.; Hughes, J.; Robertson, D. L.; Illingworth, C.

2026-04-22 bioinformatics 10.64898/2026.04.20.719634 medRxiv
Top 2%
2.7%
Show abstract

The CD8+ T cell response is a critical component of antiviral immunity, particularly in hosts who are immunocompromised or undergoing B cell-depleting therapy, such as rituximab. As viral evolution can lead to escape from CD8+ T cell recognition, tools that predict such escape are increasingly relevant. Here, we present CD8scape, an accessible command-line tool designed to predict viral escape from the CD8+ T cell response based on within-host sequence variation and HLA class I genotype. CD8scape is primarily a Julia wrapper for NetMHCpan v4.2, a neural network-based predictor trained on mass spectrometry-derived peptide presentation data. CD8scape integrates variant data and viral reading frames to identify all overlapping 8-11mer peptides at variant sites in both ancestral and derived states. These peptides are evaluated using NetMHCpan, which outputs eluted ligand (EL) scores as allele-specific percentile ranks to account for differences in MHC binding fastidiousness, and these are passed back to CD8scape itself. For each variant, the best-ranking peptide across all alleles is identified, and a harmonic mean is used to summarize presentation likelihood across the hosts HLA genotype. A fold-change between ancestral and derived harmonic means quantifies the likelihood of immune escape, with values >1 indicating reduced predicted presentation, and therefore a potential escape from the CD8+ T cell response. This is converted to a log2 value of this fold-change so that the metric is symmetric around 0, with positive values representing predicted escape. CD8scape can operate with known HLA genotypes or a representative HLA supertype panel for generalizable predictions. We demonstrate our method by application to within-host SARS-CoV-2 evolution in a rituximab-treated patient and discuss its implications for population-level CD8+ T cell escape.

17
Pan1c : a pipeline to easily build chromosome-level pangenome graphs

Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.

2026-04-21 bioinformatics 10.64898/2026.04.17.719212 medRxiv
Top 3%
2.3%
Show abstract

The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.

18
Foundation cell segmentation models performance on live microscopy and spatial-omics data

Miao, Y.; Surguladze, N.; Lerner, J.; Poysungnoen, K.; Ariano, K.; Li, Y.; Zhu, Y.; Van Batavia, K.; Jepson, J.; Van De Klashorst, J.; Ni, B. Y. X.; Armstrong, A.; Rahman, R.; Horstmeyer, R.; Hickey, J. W.

2026-04-21 bioinformatics 10.64898/2026.04.18.719315 medRxiv
Top 3%
2.1%
Show abstract

Accurate cell segmentation is an essential step for quantitative analysis of biological imaging data. Recent advances in deep learning have led to the development of generalist segmentation models that perform robustly across multiple imaging modalities, including label-free phase contrast, fluorescence cell culture, and multiplexed fluorescence tissue imaging. However, systematic comparisons of these models at the level of downstream biological analysis remain limited. To address this gap, we evaluated several recent segmentation models, including Cellpose cyto3, Cellpose-SAM, {micro}SAM, and CellSAM, on phase contrast and fluorescence cell culture images. In addition, Mesmer and InstanSeg were included for benchmarking on multiplexed fluorescence tissue images generated using CO-Detection by IndEXing (CODEX). We found that Cellpose-SAM achieved strong performance on phase contrast images, while SAM-based models consistently performed well on fluorescence cell culture data. In contrast, no single model consistently outperformed others on CODEX datasets. Instead, each model exhibited distinct strengths and limitations, which led to differences in downstream analyses, including clustering and cell type identification. Together, our study emphasizes the importance of selecting segmentation models based on dataset characteristics and analytical goals, rather than relying on a single universal approach.

19
Evaluating Expert Specialization in Mixture-of-Experts Antibody Language Models

Burbach, S. M.; Spandau, S.; Hurtado, J.; Briney, B.

2026-04-22 immunology 10.64898/2026.04.17.719246 medRxiv
Top 3%
2.1%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWAntibody language models (AbLMs) show an impressive aptitude for learning antibody features, but tend to struggle learning the highly diverse, non-templated regions of antibodies. Existing AbLMs use dense architectures, where all model parameters attend to each amino acid token. We hypothesized that the modular nature of antibodies could benefit from a sparse mixture-of-experts (MoE) architecture, allowing specific parameters (referred to as experts) to specialize in distinct antibody features. While MoE architectures are widely adopted and optimized in natural language processing domains, they are less common in biological modeling. To this end, we assess existing MoE routing strategies and find that token-choice routing strategies outperform expert-choice routing, presumably due to their specialization in CDRH3 residues. We further optimized the token-choice router for AbLMs, by minimizing the routing of padding tokens to enable pre-training with varying sequence lengths. Finally, we show that a large-scale baseline antibody language model with a Top-2 MoE architecture (BALM-MoE), trained on a mixture of unpaired and paired antibody sequences, outperforms its dense counterpart with the same number of active parameters.

20
Comparative benchmarking of single-cell transcriptomes and immune repertoires across technologies

King, C.; Iqbal, M.; Shokati, E.; Man Ying Li, C.; Li, R.; Tomita, Y.; Smith, E.; Kawecka, J. A.; Wang, S.; Fenix, K.

2026-04-22 bioinformatics 10.64898/2026.04.20.719117 medRxiv
Top 3%
2.1%
Show abstract

Immune receptor profiling enables tracking of individual T or B cell clones across time and tissues, providing insight into immune responses, cancer, and autoimmunity. When combined with single-cell transcriptomics, it links clonotype identity to cellular function, revealing the diversity and dynamics of immune cell populations. This study presents a head-to-head benchmarking comparison of two single-cell immune profiling technologies: droplet-based microfluidics from 10x Genomics (10x) and combinatorial barcoding from Parse Biosciences (Parse). Using matched human samples from PBMCs, the analysis evaluates performance across transcriptomic and T cell immune receptor features to assess data quality, reproducibility, and chain-specific recovery. The findings provide a framework for interpreting single-cell immune profiling platforms and emphasize the importance of accounting for technology-specific biases in bioinformatic analyses.